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Abstract 



We present a technology mapper for full-custom ECL gates. These gates are characterized 
by high fanins and a regular structure. Full-custom gates differ from ECL library gates in 
that a full range of structures is available as a single form, rather than a large number of 
individual gates that sparsely cover the possible design space. 

This paper presents a complete boolean matching algorithm and gives a proof of its 
correctness. We show that it can efficiently map logic into the general ECL gate form. We 
also show two variants of the algorithm, and show that they give poorer results with no 
savings in runtime. 

The mapper described in the paper is a necessary component of a CAD system for 
designing ECL microprocessors. Manual design of full-custom ECL gates would not be 
acceptable for control logic since it is a tedious, error prone, and lengthy activity. Nor 
would a gate-array style mapper and library with a limited number of gates be acceptable, 
because this makes less effective use of the inherent speed of the technology. 
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1 Introduction 



This paper presents a specialized form of boolean function mapping that is efficient for full-custom 
designs in the ECL (Emitter Coupled Logic) circuit family. By full-custom we mean the gates are 
not selected from a library, but are instead built as needed within the bounds of what the technology 
allows. This allows us to have many more gates than could be placed in a library. Existing libraries[7] 
contain only a portion of the possible gates, resulting in both wasted area and time. 

Our current application is a full-custom 64 bit ECL BiCMOS microprocessor. The characteristics 
of ECL important to this application are: complex gates with wide fanins, free negation of gate 
outputs, low gate delays, low wiring delays (due to low logic swings and high currents), and 
a density comparable to CMOS for structures other than RAM. The combination of these factors 
allows implementation of fast microprocessors where power consumption is not important and where 
CMOS RAM may be implemented elsewhere on the same chip, as is the case in our application. The 
advantages of ECL have been demonstrated with an experimental SOOMhz 115W 32b full-custom 
ECL microprocessor called BIPSO. [4] 

The technology mapper described in this paper takes advantage of two particular features of 
ECL: high fanin and a regular gate structure. The mapper is an essential part of the CAD system[6] 
we are using to design our next generation ECL microprocessor. 

2 ECL Gates 

ECL is a current-steering technology. That is, a current source provides a fixed amount of current, 
which is then routed in one of two directions using differential pairs of transistors (Figure 1). 
The differential pair works by comparing the voltages on the bases (inputs) of the transistors, and 
routing the current through the transistor that has the highest base voltage. In addition to voltages 
corresponding to logic values 1 and 0, a voltage is available that corresponds to the logic value 0.5. 
This allows us to route current with only a single input, rather than requiring two inputs that are 
complements of each other. This voltage is called the reference voltage, or Vr. 

Legal circuits will never split current between paths, except for the special OR configuration 
where several transistors reconverge the currents immediately. Figure 2 shows an example gate 
where currents can split between the OR configured transistors connected to io and i i , but reconverge 
immediately. 

Current may be routed through more than one level of differential pairs and then finally to a 
resistor. The presence or absence of current through this resistor determines the voltage across 
it. This voltage is propagated to the output of the gate using a driver called an Emitter Follower. 
Figure 2 shows a gate that implements the function F = (io + «i)«2- If *o and i\ are 0, current is 
routed from point A to point D, which is pulled low due to the voltage across the resistor. Output 
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current source 



Figure 1 : Current Steering 

O therefore goes low. There is no current through the other resistor, so O is pulled high. If either 
io or ii is 1, then current is routed instead from point A to point B. If 12 is 1, the current will go to 
D. Otherwise, it will go to C, pulling it low and setting O low. If O is low, then there is no current 
through D so it is high and O is high. 

In the most general case, ECL allows n-way current steering, not just 2-way current steering. 
Current can never be split between paths except in the OR configuration. Thus, the designer must 
ensure that no two inputs in the n-way comparison are high at the same time, or the circuit wiU 
malfunction. In addition to n-way splitting, ECL allows more than two resistors, corresponding to 
more than two outputs. However, only one output can be low at any given time due to the fact that 
current can only be routed through one resistor at a time. We have chosen to restrict ourselves to 
2-way current steering with two resistors. This allows us to use single inputs (or ORs of inputs) that 
are compared to a reference voltage, avoiding the problem of ensuring mutual exclusion. We also 
choose to only have two outputs per gate, the true and complement values of the function. 

ECL families allow the current to pass through a certain number of levels of differential pairs 
before it gets to a resistor. Each level has a voltage drop associated with it, so the power supply 
voltage for the chip determines the maximum number of levels that will fit. In our technology we 
normally use two levels, although a third level is allowed for situations where no reference voltage 
is needed. To keep things simple, only two levels are used for our automatically generated circuits. 

ECL families are also characterized by the maximum fanin of the OR terms. This number 
is determined by the noise margins, including factors such as IR drops and variable transistor 
characteristics. In our family the OR fanin limit is 10. Coupled with our 2-way current steering 
and a maximum of two levels of steering, the maximum fanin for a single ECL gate is 30. Figure 3 
shows an ECL gate with a fanin of 30. All other gates we may wish to use can be constructed from 
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Figure 2: gate for F = (io + «i)«2 

this gate by deleting inputs and connecting up the resistors differently. The top of each current path 
can be connected to either of the resistors (but not both). This gives us much flexibility in terms of 
the logic functions we can implement. When two current paths are connected to the same resistor, it 
effectively creates an OR of those paths. 

We can describe this general gate using boolean logic. There are two sorts of parameters for this 
description. Phase constants, denoted select among various wiring patterns and circuit forms. 
These cannot be changed once the gate is implemented. Input variables, denoted , are inputs to 
the the gate and of course change during gate operation. The general circuit form is: 

F(xo, X29) = muxiSx, <i>y ® Sy,(j>;, ® S;,) 

where 



Sx = xo+ ■ 


• + Xg 


Sy = Xl()+ . 


■ + Xl9 


Sz = X20+ ■ 


. + X29 



and 

mux(a, h, c) = a ■ h + a ■ c 

The <f> constants are used to select the true or complement of each secondary OR term by choosing 
which resistors the current paths are connected to. Although not represented in this equation, F is 
also available for the cost of an output driver. 
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Figure 3: 30 input ECL gate 

3 Previous Work 

Technology mapping takes a network and maps it into a gate netlist. Part of the process carves out 
subnetworks and tries to map them to a gate. Two main approaches are used to do this mapping. 
Tree matching[3, 5] has been traditional, but current work is concentrated in the area of boolean 
matching[2]. Boolean matching looks not at the shape of a subnetwork, but rather at the logic 
function it implements. A gate is chosen based on this logic function, or failure is reported if no 
single gate can implement the function. The technology mapping algorithm repeatedly asks the 
boolean matcher to find possible covers of subnetworks, and uses these results to select a good cover 
for the entire network. 

Another system[7] does technology mapping for ECL. That system, however, uses gates in a 
library rather than a general circuit form, restricting the quality of the output. We are unaware of any 
system that uses a boolean matching approach to map full-custom ECL gates. 

In our technology there is one fully -populated gate from which all other gates can be derived 
by bridging and/or deleting inputs. Although previous work[2] can handle the deletion of inputs 
to create matches, that is not the right approach for us since we only have one gate form, and the 
number of possible bridges or deletions is large. Instead, we try to reshape the function in hand to 
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see if we can put it into our general gate form. This is much faster for two reasons: the complexity 
of our algorithm is less than other boolean matching algorithms, and our algorithm matches against 
a single circuit form rather than requiring a large number of matches against different circuit forms. 

4 Matching Algorithm 

In this section, we describe an efficient algorithm to check whether a Boolean function F can be 
decomposed to our fuU-customECL circuit form. We assume that all Boolean function manipulations 
are performed using Boolean Decision Diagrams (BDDs) [1]. 

4.1 Notation 

) ben Boolean variables. Let X be a set of literals, i.e. a subset of { a; i, xi, X2, X2, ■ ■ ■ , Xn,Xn}. 
In what follows we suppose that a set of literals never contains a variable and its negation. 

• sx = J2xex * denotes the disjunction of all the literals contained in X. Its negation, s^, 
denotes the cube nj;6x ^• 

• var{X) denotes the set of variables appearing in X. For example if X = {x 1,1^3, x^} then 
var(X) = {xi, x^, xs}. 

• By extension, if a; is a literal, var{x) denotes the variable from which x is derived. 

• mux{a, b, c) denotes the Boolean function a ■ b + a ■ c. 

• form{F) is a boolean predicate that is true if and only if is a sum or a cube. When 
form{F) is true, <f>F and Xp will denote respectively a Boolean constant and a set of literals 
that are such that F = (f>p (B sxp ■ 

4.2 Description of the Algorithm 

The matching algorithm itself is simple. The main problem is to prove that it detects exactly the 
functions of the form F{xi, . . . , Xn) = mux{sx ,(t)Y ® sy ,(t)z ® sz). The difficulty resides in the 
fact that we cannot suppose that t;ar(X), var{Y) and t;ar(Z) are mutually disjoint without limiting 
the expressive power of the decomposition. 

Here is an example. Let F{a, x,y,zi, Z2) = ay+ a{x + Z1Z2). As written, F is not decomposed 
in the required form. However F can be rewritten as follows: F{a, x, y, zi, Z2) = mux{a + 
0 © (a + y), 1 © (21 + ^2)). which is an acceptable decomposition. F does not have any such 
decomposition for which the sets var{X) and var{Y) are disjoint. 
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Algorithm 

Input: A Boolean function F{xi, . . ., 

Output: If it exists, a triplet (Fx, Fy, Fz) of boolean functions such that F(xi, . . . , Xn) = 

mux{Fx , Fy , Fz) where Fx is a sum of literals, form{FY ) is true and form{Fz) is 
true. 

1. Compute the cof actor F^ for every literal x. 

2. Group the literals by equivalence classes; two literals x and x' are considered to be equivalent 

if Fx = Fx' . 

3. For every equivalence class X do: 

(a) Let X be any element of the equivalence class X. Let Fy be the Boolean function F^. 
If form{FY ) is not true, skip to the next equivalence class. 

(b) If form(F—) is true, return the result (sx , Fy , F—). 

(c) Compute the set X' of literals v satisfying the following two properties: var{v) £ 
var(FY) and -Ft, = (Fy)v 

(d) If form{ F,^^^, ) is true, then return the result (sxux' , Fy , Fj^^) 

4. For every literal x do: 

(a) If form{Fx ) is not true, skip to the next literal. 

(b) Repeat steps 3c and 3d with X = 0 and Fy = ® {x + ^Xp^ )■ 

5 . If all literals and equivalence classes have been processed without finding a solution, F cannot 
be decomposed as desired. 

4.3 Time Complexity 

The cofactor of a BDD by a literal has time complexity 0{N) where N is the number of BDD 
nodes. The most expensive step of the algorithm is step 3c, which may require up to 0{n^) 
cofactor computations in total. The worst-case complexity of the algorithm is thus 0{n^ x N). 
We do not expect the worst-case complexity to be attained often. Equivalence classes and the test 
var{v) E var{FY) act as a filter, reducing the term 0{n^). Moreover Fy has \X\ fewer variables 
than F and is likely to have a smaller BDD representation. 

4.4 Proof of Correctness 

Lemma 4.1 Let F and Fy be two Boolean functions and Xq a set of literals such that for each v in 
Xq we have Fy = {Fy)v Then F = mux(sxQ, Fy, Fjj^). 
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Proof Let G = mux(sxo, Fy , Fjj^)- We only need to prove that F = G when sxo = 1- Let v 
be a literal in Xq. By hypothesis, we have: Fy = {Fy)v ■ On the other hand by definition of G we 
have Gt, = {Fy)v Thus for every literal in Xq we have = G^ which proves that F = G. m 

Theorem 4.2 Every solution found by the algorithm is a valid decomposition of F. 

Proof In step 3b, 3d and 4b of the algorithm, the pairs (Fy , Xq = X), (Fy , Xq = X U X') and 
{FyjXq = XLIX') satisfy the hypothesis of lemma 4. L Moreover the decompositions are returned 
by the algorithm only if form{FY ) and form{Fjj^) are both true. Thus the algorithm only returns 
valid decompositions of F. m 

Lemma 4.3 Let F be such that F = mux{sx, Fy, Fz) and form{Fz) true. Let Xq be a set 
of literals containing X and such that for every literal v in Xq we have Fy = {Fy)v Then 

F = mux{sxa, Fy, Fjj^) and form{Fjj^) is true. 

Proof Lemma 4.1 implies that F = mux{sxf) , Fy , Fjj^). Since Xq contains X we have Fjj^ = 
iFz)j3^- By hypothesis Fz is a cube or a sum; therefore the cofactor of Fz by the cube Jx^ is also 
a cube or a sum, which proves that form{Fjj^) is true. ■ 

Lemma 4.4 Let F be such that F = mux(sx, Fy, Fz). Let Xi = {v E X, var(v) ^ var(FY)} 
andX2 = {v e X,var{v) £ t;ar(_Fy)}. Then the following assertions hold: 

(i) ifXi ^ 0, Xi is contained in a unique equivalence class Xgq. 

(ii) if X' = {v,var{v) £ var{FY) and Fy = (-Fy)t,} as in step 3d of the algorithm, then 
X2 C X'. 

Proof (i) Let t; be an element of Xi. Since var{v) does not belong to var{FY), we have 
Fy = (-Fy )t, = Fy ■ Thus all cofactors of F by elements of X\ are equal to the same function, Fy, 
which proves that Xi is a subset of an equivalence class. If Xi is not empty, then this equivalence 
class is unique. 

(ii) By definition, every element v of X2 is such that var{v) E var{FY). Moreover since 
X2 C X we have Fy = (i^y )t, which proves that v belongs to X' . ■ 

Lemma 4.5 Let F = mux(sx , Fy , Fz) be such that form(FY) and form(Fz) are true. Let 
Xi, X2 and X' be as in lemma 4.4. We suppose that Xi 7^ 0. Let Xgq be the equivalence class 
containing Xi and x an element of Xi. Then Fy = F^ and when the algorithm is processing the 
equivalence class Xgq.' 

• ifX2 = 0 the algorithm returns the valid decomposition mux{sx^g , F^, Fj^) in step 3b. 

• ifX2 7^ 0 the algorithmreturns the valid decomposition mux{sx^gux' , F^, F,,^ ^^, ) instep 
3d. 
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Proof If X2 = 0, apply lemma 4.1 with Fy = and Xo = Xeq. If X2 7^ 0, apply lemma 4.1 
with Fy = Fx and Xq = Xgq U X'. Lemma 4.4 shows that in both cases X C Xq. Lemma 4.3 
concludes the proof. ■ 

Lemma 4.6 Let F = mux{sx, Fy, Fz) be such that form{FY) and form{Fz) are true. Let 
Xi, X2 and X' be as in lemma 4.4. We suppose that X\ = 0. If one exists, let x be an element 
of X2 appearing in negated form in Xp^. Let X' be as in lemma 4.4. Then Fy = (/^f^ © + 
sxp ) and when the algorithm processes the literal x in step 4b it returns the valid decomposition 

mux(sx' , (t>F^ © + sxfJ, 

Proof We have F^ = {Fy)x- Since form(FY) is true, form(Fx) is also true. Thus {Fy)x = 
Fx = (jip^® sxf^ ■ By hypothesis, x appears in negated form in sxp^ , thus Fy = (/^f^ © + sxp^ ) ■ 
From lemma 4.1, we deduce that = mux(sx' , Fy, F—). Since Xi = 0, wehaveX = X2 C X'. 
We conclude with lemma 4.3 that form{Fjjj-) is true. ■ 

Theorem 4.7 If a solution exists, it is found and returned. 

Proof The only case not covered by the lemmas 4.5 and 4.6 is the case where Xi = 0 and all 
literals in X2 appear in Fy unnegated. However in that case Fy = © (sx + sy) for some set of 
literals Y, which means that Fy can be replaced by the constant <f) and F = mux{sx , <!>, Fz). This 
case is handled by step 3b of the algorithm. ■ 

5 Experimental Results 

We have chosen three blocks of control equations from our current processor design. 
Characteristics of these three blocks are shown in Table 1. The first block is the major 
control block in our design, controlling the integer and floating point datapaths. The other 

two are small blocks typical of the rest of the design. 

Table 2 shows the results of mapping. Most gates are small, with fanins of 5 or less. 
Much of this is the result of small equations in the original design specification. Many 
equations are simple, in that they combine only a few variables or just pass data from one 
pipe stage to the next. Timing verification has shown us, however, that the critical path of 
the design contains more complex logic, so it is important that our mapping algorithm find 
good solutions. It is worth noting that the maximum fanin of the gates produced by our 
algoritlun is relatively high. 
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circuit 


# eqns 


#lits 


lit/eqn 


Control 


2319 


9834 


4.2 


FPACtl 


71 


141 


1.99 


FPDivCtl 


43 


86 


2.0 



Table 1 : Examples Used 



circuit 


# gates 


gate/eqn 


fanin 


max fanin 


Control 


3534 


1.53 


2.89 


22 


FPACtl 


71 


1.00 


2.06 


8 


FPDivCtl 


43 


1.00 


2.05 


5 



Table 2: Performance of Full Algorithm 



circuit 


# gates 


gate/eqn 


fanin 


max fanin 


time 


Control 


3575 


1.54 


2.87 


22 


0.98 


FPACtl 


73 


1.03 


2.01 


8 


1.04 


FPDivCtl 


43 


1.00 


2.05 


5 


1.02 



Table 3: Eliminating Steps 3d and 4 



circuit 


# gates 


gate/eqn 


fanin 


max fanin 


time 


Control 


5773 


2.50 


2.16 


5 


1.15 


FPACtl 


92 


1.30 


1.80 


5 


1.50 


FPDivCtl 


43 


1.00 


2.05 


5 


0.99 



Table 4: Limited Mapping 



# eqns: number of equations in the circuit 

# lits: number of uses of literals in factored form 
lit/eqn: average number of literals per equation 

# gates: number of gates produced by the algorithm 
gate/eqn: average number of gates per equation 
fanin: average gate fanin 

max fanin: maximum gate fanin 

time: ratio of runtime to full algorithm 



10 



We experimented with two variants of the algorithm, based on observation of the 
algorithm's behavior on our three examples. We observed that 98.6% of the solutions found 
were found in step 3b of the algorithm, and all the remaining solutions were found in step 
3d. Step 4 was not needed for any of our examples, although we can construct artificial 
examples that do require step 4. 

Based upon this data, we tried eliminating steps 3c, 3d and 4 from our algorithm. Table 
3 shows the results. CPU time is essentially unchanged, while the number of gates needed 
has gone up slightly. It is important to note that when the algorithm misses a match, the 
algorithm will be called again on a smaller piece of logic. Thus, matching is tried over and 
over on different pieces until a match is found. A faster algorithm that misses matches can 
actually result in longer run times and poorer results. 

Another variant we tried was to limit the mapped functions to F = mux{Fx, Fy,Fz), 
where F^, Fy, and F^ have disjoint support sets. In this case, we know that the function's 
variables partition into no more than six equivalence classes. We combine steps 1 and 2 of 
the algorithm, computing equivalence classes as we go. We can terminate our computation 
early if more than 6 are found. In addition, in step 3 we pick the equivalence class with 
the most members, rather than iterating over all equivalence classes. This results in an 
algorithm that terminates early on complicated cases. Table 4 shows that this is a poor 
choice. The algorithm misses so many matches that the total runtime increases and the 
results become poorer. 

6 Conclusion and Future Work 

We have demonstrated a technology mapper for full-custom ECL gates. The technology 
mapper takes advantage of the high fanin and regular structure of these gates to implement 
each equation using a small number of gates. The algorithm proceeds using efficient 
operations on BDDs, producing a mapping in an acceptable amount of time. 

Not described in this paper are electrical optimizations on the gates produced. These 
optimizations include separate power sizing of the logic portion of the gate and the output 
driver. This power sizing is done by starting with low power everywhere, and then walking 
over the graph of gates and wires increasing the power along the critical path. Trial 
placements of the gates are done in order to estimate capacitance, which is taken into 
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account during the power adjustment phase. Additional optimizations change voltage 
swings and convert signals to differential pairs when it is possible given the choice of 
gates. Certain gates, such as those with a large number of OR terms at the bottom, can be 
best implemented using a slightly different circuit called a level-shifting-OR, so we have a 
pattern matching and replacement phase to take care of this and similar optimizations. 

Future extensions to the mapper described here could look at a number of factors. The 
delay through an ECL gate is not the same for each input, so it makes a difference to which 
input a variable is assigned. In order to take this into account in our algorithm, we would 
have to have available the arrival times of the individual inputs and a model for gate delay. 
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